Analyzing CogSci proceedings authors


In [ ]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

Load the papers, and include only talks and posters:


In [ ]:
import pandas as pd
papers = pd.read_csv("cogsci_proceedings.csv", encoding='utf-8', index_col=0)
papers = papers[(papers['type'] == 'talk') | (papers['type'] == 'poster')]
papers.head(3)

Let's now create a graph of author connectivity -- one link between each pair of authors that has published together:


In [ ]:
import graph
G = graph.make_author_graph(papers)

Given the graph, we can now do some analysis. Let's first look at the number of connected components -- these are groups of researchers who have only published with each other and not with the rest of the CogSci community:


In [ ]:
import networkx as nx

In [ ]:
subgraphs = nx.connected_components(G)
subgraph_sizes = np.array([len(x) for x in subgraphs])
print subgraph_sizes[:5]

It looks like most authors are connect to each other, somehow, but there are a few groups that are isolated. We can take a look at the distribution of these isolated groups, in which we see these groups consist mostly of authors who have published only once or twice with a small number of other authors:


In [ ]:
plt.hist([len(x) for x in nx.connected_components(G)][1:], bins=30)
plt.title("Histogram of connected component sizes")

Let's look more closely at individual authors. We can construct a new dataframe that gives us information on each author, including the number of papers they have authored, the number of coauthors, and the pagerank:


In [ ]:
authors = papers.set_index('author').groupby(level='author').apply(len)
authors.name = 'papers'
authors = authors.to_frame()
authors['coauthors'] = pd.Series(nx.degree(G))
authors['pagerank'] = pd.Series(nx.pagerank(G))
authors.sort('pagerank').tail(10)

This is a histogram of the number of coauthors each person has:


In [ ]:
plt.hist(authors['coauthors'], bins=30)
plt.title("Histogram of total coauthors")

This is a histogram of authors' pageranks:


In [ ]:
plt.hist(authors['pagerank'], bins=100)
plt.title("Histogram of author pageranks")

We can also look at the distribution of clique sizes:


In [ ]:
def clique_hist(cliques):
    clique_sizes = [len(x) for x in cliques]
    plt.hist(clique_sizes)

In [ ]:
cliques = list(nx.find_cliques(G))
clique_hist(cliques)

And finally, a graph of author connectivity. Size indicates squared degree (number of total coauthors, squared) and color indicates number of papers. The graph only includes people who authored at least 3 papers, and only people with at least 15 coauthors are labeled.


In [ ]:
graph.draw(G, with_labels=True, n=15, threshold=3)
fig = plt.gcf()
fig.set_figwidth(20)
fig.set_figheight(20)